Producing charts for ‘The Changing Fortunes of the Richest Countries in Grand Slam Tennis’
By: Dr. Chris Martin
Tools / packages used: R, R Markdown,
ggplot2, tidyverse (inc. dplyr and
tidyr), plotly.
Techniques used: exploratory data analysis,
functional programming (purrr package), data visualisation,
data cleaning/reshaping/manipulation.
Chart types used: area chart, line chart, bar chart,
stacked bar chart, small multiples, heatmap, ridge chart, interactive
charts (with ggplotly).
Source data: To produce the charts, I needed data on the women’s and men’s singles entrants for each Grand Slam tournament since 1990. This came from the excellent Tennis Abstract.
This notebook produces the static data visualisations which features in my data storytelling project: The Changing Fortunes of the Richest Countries in Grand Slam Tennis. You can read the full story on my website.
A note on my data visualistion workflow
The chart produced in this notebook are ‘skeletons’ with fairly minimal styling, but all the key structural components in places. The chart are exported from this notebook as svgs. These can are then editted - adding textures, photos, annotations etc. - using graphic design software to create the final versions.
Setting up the notebook
# import packages
library(tidyverse) # for data manipulation and viz
library(knitr) # for formatting tables
library(kableExtra) # for formatting tables
# set default theme for exploratory plots
theme_set(theme_light()) # using a minimal theme to make it easier to edit
# the plots in graphic design software later on
# set default R markdown chunk options
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
# output the first lines of a dataframe in a nice format
scrollable_table <- function(df){
df %>%
kable("html") %>%
kable_styling() %>%
scroll_box(width = "100%", height = "200px",
fixed_thead = list(enabled = T, background = "grey90"))
}Reading in the data
The data is read in from csvs produced in
data_clean.Rmd, a lot of the data preparation was done in
that notebook.
gs_first_round_gdp <- read_csv("../data/results_gdp.csv") %>%
#just look up until covid pandemic (as will have distorted things)
filter(year < 2020)
# check data looks as expected
gs_first_round_gdp %>%
head() %>%
kable()| year | tourney_name | tour | name | id | ioc | country | gdp_per_capita | iso |
|---|---|---|---|---|---|---|---|---|
| 1990 | Australian Open | atp | Jim Pugh | 101004 | USA | United States | 40436.94 | USA |
| 1990 | Australian Open | atp | Ivan Lendl | 100656 | USA | United States | 40436.94 | USA |
| 1990 | Australian Open | atp | Cyril Suk | 101327 | CZE | Czechia | 23585.18 | CZE |
| 1990 | Australian Open | atp | Tomas Carbonell | 101507 | ESP | Spain | 27543.92 | ESP |
| 1990 | Australian Open | atp | Michael Brown B395 | 101895 | AUS | Australia | 31016.42 | AUS |
| 1990 | Australian Open | atp | Karel Novacek | 101120 | CZE | Czechia | 23585.18 | CZE |
gs_entries_by_country <- read_csv("../data/gs_entries_by_country.csv") %>%
# just look up until covid pandemic (as will have distorted things)
filter(year < 2020)
# check data looks as expected
gs_entries_by_country %>%
head() %>%
kable()| year | tourney_name | country_code | country | gdp_per_capita | num_first_rd | income_decile | top_20_perc |
|---|---|---|---|---|---|---|---|
| 1990 | Australian Open | USA | United States | 40436.94 | 55 | 10 | TRUE |
| 1990 | Australian Open | AUS | Australia | 31016.42 | 36 | 9 | TRUE |
| 1990 | Australian Open | FRA | France | 33732.02 | 20 | 9 | TRUE |
| 1990 | Australian Open | GER | Germany | 36699.48 | 20 | 9 | TRUE |
| 1990 | Australian Open | SWE | Sweden | 34156.82 | 14 | 9 | TRUE |
| 1990 | Australian Open | CZE | Czechia | 23585.18 | 10 | 8 | FALSE |
How have the richest 20% of countries performed at Grand Slams (1990 - 2019)?
This section focuses on how well the richest 20% of countries performed at Grand Slams (1990 - 2019). The metric used for a country’s performance is how many players they had appearing in the first round of Grand Slam tennis tournaments.
The overall trend
The performance of the richest countries declined 1900 to 2008, and then picked up again a little.
# ----------------------------------------------------------------------------
# Reshape data for plotting
# ----------------------------------------------------------------------------
plot_df <- gs_entries_by_country %>%
# looked performance for two groups of countries
# top 20% richest countries and the rest
mutate(top_20_perc = if_else(is.na(top_20_perc), FALSE, top_20_perc)) %>%
group_by(year, top_20_perc) %>%
summarise(num_first_rd = sum(num_first_rd)) %>%
ungroup() %>%
# calculate proportions from counts
group_by(year) %>%
mutate(perc_first_round = num_first_rd / sum(num_first_rd)) %>%
ungroup()
# ----------------------------------------------------------------------------
# Produce the plot
# ----------------------------------------------------------------------------
p <- ggplot(plot_df) +
# core chart
geom_area(aes(year, perc_first_round, fill = top_20_perc)) +
# format axis
scale_y_continuous(labels = scales::percent_format(accuracy = 1L),
expand = c(0,0)) +
scale_x_continuous(expand = c(0,0),
breaks=c(1990,1995,2000,2005,2010,2015,2019)) +
coord_cartesian(clip = 'off') +
# tidy up presentation
labs(x = NULL, y = NULL) +
theme(legend.position = "none",
plot.margin = margin(25,25,25,25))
p # ----------------------------------------------------------------------------
# Export the plot for editing
# ----------------------------------------------------------------------------
ggsave("../images/all_gs.svg", units = "mm", width = 600, height = 325)Is the overall trend driven by the composition of 20% richest countries changing over time?
When I saw showing early versions of the visualisation to members of the target audience, a question came up. Are the trends indentified a result of the countries in the top 20% richest changing? In this section of the notebook I reshape so I can produce a chart that shows that this isn’t drivnig the trend.
# ----------------------------------------------------------------------------
# Reshape data for plotting
# ----------------------------------------------------------------------------
# count number of first round appearances per country per year
top_twenty_countries <- gs_entries_by_country %>%
filter(top_20_perc) %>%
count(year, country, wt = num_first_rd) %>%
arrange(year, desc(n)) %>%
group_by(country) %>%
mutate(country_ave_n = mean(n)) %>%
ungroup()
# average number of appearances per country each year (1990 - 2019)
country_ave_ns <- top_twenty_countries %>%
distinct(country, country_ave_n)
# Count how many time the countries appear in the top twenty percent
top_twenty_counts <- top_twenty_countries %>%
count(country) %>%
left_join(country_ave_ns) %>%
arrange(desc(n), desc(country_ave_n)) %>%
rename(years_top_20 = n)
# this will be useful later for ordering the countriesin the plot
levels <- top_twenty_counts$country
# create a grid to see where countries have moved out of top 20 perc
grid <- expand_grid(year = unique(top_twenty_countries$year),
country = unique(top_twenty_countries$country))
# identify implicitly missing data
plot_df <- grid %>%
left_join(top_twenty_countries)
# identify countries in bottom 80%
bottom_80_countries_by_year <- gs_entries_by_country %>%
distinct(year, country, income_decile) %>%
filter(income_decile < 9)
# function for checking if a country is the bottom 80% in a given year
check_bottom_80 <- function(year, country){
selector <- bottom_80_countries_by_year$year == year &
bottom_80_countries_by_year$country == country
res <- bottom_80_countries_by_year[selector, ]
if(nrow(res) == 0){
return(FALSE)
}
else {
return(res[[1,"income_decile"]] < 9)
}
}
# quick test of check_bottom_80()
check_bottom_80(1990, "Nigeria")## [1] TRUE
# update plotting dataframe with variable recording if a country is the bottom 80%
# in a given year
plot_df_1 <- plot_df %>%
mutate(in_bottom_80 = map2_lgl(year, country, ~check_bottom_80(.x,.y)),
n = if_else(is.na(n) & in_bottom_80,
-1, n),
n = replace_na(n, 0),
bin_n = cut(n, breaks = c(-Inf, -1e-10,0,1e10, 10, 50, 100, Inf)))
# ----------------------------------------------------------------------------
# Produce the plot
# ----------------------------------------------------------------------------
p <- ggplot(plot_df_1) +
# core plot
geom_tile(aes(x = year,
y = factor(country, levels = rev(levels)),
fill = bin_n),
colour = "#F8F7F7") +
# plot text
labs(x= NULL, y = NULL) +
guides(fill = guide_legend(reverse=TRUE)) +
# colours
scale_fill_manual(values = c("#E7E4E5", "white", '#cfcde7',
'#9f9dce', '#6e70b6', '#35469d')) +
# simplify plot for editting
scale_x_continuous(position = "top") +
coord_equal() +
theme_minimal()
p# output plot for editing
ggsave("../images/image_4.svg")
# a quick test to confirm that the plot makes sense
gs_entries_by_country %>%
filter(country == "Israel") %>%
distinct(year, income_decile) %>%
scrollable_table()| year | income_decile |
|---|---|
| 1990 | 8 |
| 1991 | 8 |
| 1992 | 8 |
| 1993 | 8 |
| 1994 | 8 |
| 1995 | 8 |
| 1996 | 8 |
| 1997 | 8 |
| 1998 | 8 |
| 1999 | 8 |
| 2000 | 8 |
| 2001 | 8 |
| 2002 | 8 |
| 2003 | 8 |
| 2004 | 8 |
| 2005 | 8 |
| 2006 | 8 |
| 2007 | 8 |
| 2008 | 8 |
| 2009 | 8 |
| 2010 | 8 |
| 2011 | 8 |
| 2012 | 8 |
| 2013 | 9 |
| 2014 | 9 |
| 2015 | 9 |
| 2016 | 9 |
| 2017 | 9 |
| 2018 | 9 |
The decline of the richest countries
Having established the overall trend is not an artifact of countries moving in and out the richest 20%, this section looks in detail at decline of the richest countries performance 1990-2008.
# ----------------------------------------------------------------------------
# Reshape data for plotting
# ----------------------------------------------------------------------------
# simplify representation of missing data
gs_entries_by_country_clean <- gs_entries_by_country %>%
mutate(top_20_perc = if_else(is.na(top_20_perc), FALSE, top_20_perc))
# count first round appearance for all countries by year
all_gs_entries_by_country <- gs_entries_by_country_clean %>%
group_by(year, country) %>%
summarise(num_first_rd_year = sum(num_first_rd)) %>%
ungroup() %>%
left_join(distinct(gs_entries_by_country_clean, country,
year, top_20_perc)) %>%
group_by(year) %>%
mutate(perc_first_round = num_first_rd_year / sum(num_first_rd_year)) %>%
ungroup()
# output to check counts make sense for top 20 vs bottom 80
all_gs_entries_by_country %>%
group_by(year, top_20_perc) %>%
summarise(perc_first_round = sum(perc_first_round)) %>%
scrollable_table()| year | top_20_perc | perc_first_round |
|---|---|---|
| 1990 | FALSE | 0.2183575 |
| 1990 | TRUE | 0.7816425 |
| 1991 | FALSE | 0.2222222 |
| 1991 | TRUE | 0.7777778 |
| 1992 | FALSE | 0.2280702 |
| 1992 | TRUE | 0.7719298 |
| 1993 | FALSE | 0.2400389 |
| 1993 | TRUE | 0.7599611 |
| 1994 | FALSE | 0.2725509 |
| 1994 | TRUE | 0.7274491 |
| 1995 | FALSE | 0.2575316 |
| 1995 | TRUE | 0.7424684 |
| 1996 | FALSE | 0.2810078 |
| 1996 | TRUE | 0.7189922 |
| 1997 | FALSE | 0.3071705 |
| 1997 | TRUE | 0.6928295 |
| 1998 | FALSE | 0.3166023 |
| 1998 | TRUE | 0.6833977 |
| 1999 | FALSE | 0.3496638 |
| 1999 | TRUE | 0.6503362 |
| 2000 | FALSE | 0.3563107 |
| 2000 | TRUE | 0.6436893 |
| 2001 | FALSE | 0.3578337 |
| 2001 | TRUE | 0.6421663 |
| 2002 | FALSE | 0.3972868 |
| 2002 | TRUE | 0.6027132 |
| 2003 | FALSE | 0.4046467 |
| 2003 | TRUE | 0.5953533 |
| 2004 | FALSE | 0.4078695 |
| 2004 | TRUE | 0.5921305 |
| 2005 | FALSE | 0.4181818 |
| 2005 | TRUE | 0.5818182 |
| 2006 | FALSE | 0.4306358 |
| 2006 | TRUE | 0.5693642 |
| 2007 | FALSE | 0.4324324 |
| 2007 | TRUE | 0.5675676 |
| 2008 | FALSE | 0.4613900 |
| 2008 | TRUE | 0.5386100 |
| 2009 | FALSE | 0.4434698 |
| 2009 | TRUE | 0.5565302 |
| 2010 | FALSE | 0.4472656 |
| 2010 | TRUE | 0.5527344 |
| 2011 | FALSE | 0.4469328 |
| 2011 | TRUE | 0.5530672 |
| 2012 | FALSE | 0.4316406 |
| 2012 | TRUE | 0.5683594 |
| 2013 | FALSE | 0.4208984 |
| 2013 | TRUE | 0.5791016 |
| 2014 | FALSE | 0.4287109 |
| 2014 | TRUE | 0.5712891 |
| 2015 | FALSE | 0.4345703 |
| 2015 | TRUE | 0.5654297 |
| 2016 | FALSE | 0.4394531 |
| 2016 | TRUE | 0.5605469 |
| 2017 | FALSE | 0.4003906 |
| 2017 | TRUE | 0.5996094 |
| 2018 | FALSE | 0.3857422 |
| 2018 | TRUE | 0.6142578 |
| 2019 | FALSE | 0.3984375 |
| 2019 | TRUE | 0.6015625 |
# focus on top 20 entries
gs_entries_top_20_perc <- all_gs_entries_by_country %>%
filter(top_20_perc)
# identify implicitly missing data
grid <- expand_grid(year = unique(gs_entries_top_20_perc$year),
country = unique(gs_entries_top_20_perc$country))
# create variables needed for plotting
plot_df <- grid %>%
left_join(gs_entries_top_20_perc) %>%
mutate(perc_first_round = replace_na(perc_first_round, 0)) %>%
filter(year <= 2008) %>%
mutate(is_usa = country == "United States")
# calculate change in performance (per country)
# compares number of first round appearance 1990 to 2008
# for grouping, ordering and colouring countries in the area chart
change_df <- plot_df %>%
filter(year == max(year) | year == min(year)) %>%
select(year, country, perc_first_round) %>%
pivot_wider(names_from = year,
values_from = perc_first_round,
values_fill = 0) %>%
mutate(change = `2008` - `1990`,
fall = change < 0,
change_bin = cut(change,
breaks = c(-Inf, -0.1, -0.02, 0.02, 0.1, Inf))) %>%
arrange(change)
# add in change in performance to dataframe for plotting
plot_df_1 <- plot_df %>%
left_join(change_df) %>%
group_by(country) %>%
mutate(ave_num_first_round = mean(num_first_rd_year, na.rm = TRUE)) %>%
ungroup() %>%
arrange(fall, desc(change)) %>%
mutate(country = factor(country, levels = unique(country)))
# ----------------------------------------------------------------------------
# Produce the plot
# ----------------------------------------------------------------------------
p <- ggplot(plot_df_1) +
# core plot
geom_area(aes(year, perc_first_round,
group = country,
fill = change_bin),
colour = "grey80", size = 0.2) +
# colours
scale_fill_manual(values = c("#c94a54", "#efb3aa",
"#fffff0", "#aeabcd")) +
# format axis
scale_y_continuous(labels = scales::percent_format(accuracy = 1L),
breaks = c(seq(0,0.8,0.1)),
expand = expansion(mult = c(0, .1))) +
scale_x_continuous(expand = c(0,0),
breaks=c(1990,1995,2000,2005,2008)) +
# simplify plot for editting
labs(x = NULL, y = NULL) +
theme(legend.position = "none",
plot.margin = margin(25,25,25,25))
p# output plot for editting
ggsave("../images/image_5.svg", units = "mm", width = 525, height = 350)
# output plot as interactive for inspection
plotly::ggplotly(p)# ----------------------------------------------------------------------------
# Get the information needed to annotate the plot
# ----------------------------------------------------------------------------
levels(plot_df_1$change_bin)## [1] "(-Inf,-0.1]" "(-0.1,-0.02]" "(-0.02,0.02]" "(0.02,0.1]" "(0.1, Inf]"
# get data points for plot annotation
plot_df_1 %>%
filter(country %in% c("United States", "Australia", "Sweden")) %>%
group_by(year) %>%
summarise(perc_first_round_tot = sum(perc_first_round)) %>%
scrollable_table()| year | perc_first_round_tot |
|---|---|
| 1990 | 0.3951691 |
| 1991 | 0.3508772 |
| 1992 | 0.3255361 |
| 1993 | 0.3323615 |
| 1994 | 0.3064985 |
| 1995 | 0.2847425 |
| 1996 | 0.2635659 |
| 1997 | 0.2364341 |
| 1998 | 0.2326255 |
| 1999 | 0.2199808 |
| 2000 | 0.2165049 |
| 2001 | 0.1953578 |
| 2002 | 0.1841085 |
| 2003 | 0.1955470 |
| 2004 | 0.1794626 |
| 2005 | 0.1655502 |
| 2006 | 0.1512524 |
| 2007 | 0.1563707 |
| 2008 | 0.1389961 |
plot_df_1 %>%
filter(country == "United States") %>%
group_by(year) %>%
summarise(perc_first_round_tot = sum(perc_first_round)) %>%
scrollable_table()| year | perc_first_round_tot |
|---|---|
| 1990 | 0.2531401 |
| 1991 | 0.2261209 |
| 1992 | 0.2183236 |
| 1993 | 0.2157434 |
| 1994 | 0.1920466 |
| 1995 | 0.1827017 |
| 1996 | 0.1511628 |
| 1997 | 0.1395349 |
| 1998 | 0.1341699 |
| 1999 | 0.1325648 |
| 2000 | 0.1281553 |
| 2001 | 0.1179884 |
| 2002 | 0.1220930 |
| 2003 | 0.1316554 |
| 2004 | 0.1199616 |
| 2005 | 0.1090909 |
| 2006 | 0.1001927 |
| 2007 | 0.0994208 |
| 2008 | 0.0907336 |
# get players from a country in a given year (for annotation)
get_players <- function(country_str, year_int){
gs_first_round_gdp %>%
filter(country == country_str & year == year_int) %>%
distinct(name)
}
get_players("United States", 1990) %>%
scrollable_table()| name |
|---|
| Jim Pugh |
| Ivan Lendl |
| Tim Wilkison |
| Todd Witsken |
| Glenn Layendecker |
| Jimmy Brown |
| John McEnroe |
| Dan Goldie |
| Leif Shiras |
| Richey Reneberg |
| Ronald Agenor |
| Tim Mayotte |
| Pete Sampras |
| Aaron Krickstein |
| David Wheaton |
| Jimmy Arias |
| Jim Courier |
| Paul Chamberlin |
| Paul Annacone |
| Kelly Jones |
| Scott Davis |
| Lawson Duncan |
| Jay Berger |
| Andre Agassi |
| Michael Chang |
| Richard Matuszewski |
| Derrick Rostagno |
| Malivai Washington |
| Jim Grabb |
| Joey Rive |
| Bryan Shelton |
| David Pate |
| Michael Robertson |
| Brian Garrow |
| Brad Pearce |
| Rick Leach |
| Jeff Tarango |
| Kevin Curren |
| Brad Gilbert |
| Ken Flach |
| Martin Blackman |
| David Witt |
| Ivan Baron |
| Chris Garner |
| Tommy Ho |
| Patrick McEnroe |
| Robert Seguso |
| Steve Bryan |
| Todd Martin |
| Carrie Cunningham |
| Erika De Lone |
| Kimberly Kessaris |
| Shaun Stafford |
| Elise Burgin |
| Ronni Reis |
| Patty Fendick |
| Audra Keller |
| Stacey Martin |
| Pam Shriver |
| Lori Mcneil |
| Louise Allen |
| Jill Smoller |
| Mary Joe Fernandez |
| Robin White |
| Kathy Rinaldi Stunkel |
| Marianne Werdel Witmeyer |
| Ann Henricksson |
| Donna Faber |
| Andrea Leand |
| Camille Benjamin |
| Rosalyn Fairbank |
| Beverly Bowes |
| Heather Ludloff |
| Katrina Adams |
| Zina Garrison |
| Cammy Macgregor |
| Laxmi Poruri |
| Tami Whitlinger Jones |
| Terry Phelps |
| Gigi Fernandez |
| Halle Cioffi |
| Ann Grossman |
| Amy Frazier |
| Jennifer Santrock |
| Hu Na |
| Susan Sloane Lundy |
| Linda Wild |
| Betsy Nagelsen |
| Gretchen Magers |
| Jennifer Capriati |
| Monica Seles |
| Meredith Mcgrath |
| Kathy Jordan |
| Wendy White |
| Mareen Louie Harper |
| Mary Lou Piatek |
| Anna Ivan |
| Anne Smith |
| Martina Navratilova |
| Andrea Farley |
| Lisa Raymond |
| Chanda Rubin |
| Caroline Kuhlman |
| Sandy Collins |
| Eleni Rossides |
| Debbie Graham |
get_players("United States", 2008) %>%
scrollable_table()| name |
|---|
| Sam Querrey |
| Vincent Spadea |
| Donald Young |
| Robby Ginepri |
| Scoville Jenkins |
| Mardy Fish |
| James Blake |
| Bobby Reynolds |
| Wayne Odesnik |
| John Isner |
| Kevin Kim |
| Andy Roddick |
| Ryler Deheart |
| Brendan Evans |
| Ryan Sweeting |
| Robert Kendrick |
| Austin Krajicek |
| Sam Warburg |
| Michael Russell |
| Rajeev Ram |
| Julie Ditty |
| Jill Craybas |
| Lindsay Davenport |
| Ashley Harkleroad |
| Lilia Osterloh |
| Madison Brengle |
| Serena Williams |
| Venus Williams |
| Meilen Tu |
| Vania King |
| Laura Granville |
| Bethanie Mattek Sands |
| Melanie Oudin |
| Kristie Ahn |
| Gail Brodsky |
| Jamea Jackson |
| Alexa Glatch |
| Asia Muhammad |
| Ahsha Rolle |
| Shenay Perry |
| Coco Vandeweghe |
The mini resurgence of the richest countries
After 2008 there is an improvement in the performance of the richest countries. This chunk of code produces a plot that focusses in on this trend.
# ----------------------------------------------------------------------------
# Reshape data for plotting
# ----------------------------------------------------------------------------
# focus on 2009 onwards and find implicitly missing data
plot_df <- grid %>%
left_join(gs_entries_top_20_perc) %>%
mutate(perc_first_round = replace_na(perc_first_round, 0)) %>%
filter(year > 2008)
# calculate change in performance of coutries between 2009 and 2019
change_df <- plot_df %>%
filter(year == max(year) | year == min(year)) %>%
select(year, country, perc_first_round) %>%
pivot_wider(names_from = year,
values_from = perc_first_round,
values_fill = 0) %>%
mutate(change = `2019` - `2009`,
fall = change < 0,
# create bins for grouping countries
change_bin = cut(change,
breaks = c(-Inf, -0.1, -0.02, 0.02, 0.1, Inf))) %>%
arrange(change)
# define countries of particular interest for in plot
countries_of_int <- c("United States", "France", "Sweden", "Australia", "Spain")
# group all the other (not of specific interest) countries
plot_df_1 <- plot_df %>%
left_join(change_df) %>%
mutate(country = if_else(country %in% countries_of_int,
country,
"Other")) %>%
group_by(country, year) %>%
summarise(perc_first_round = sum(perc_first_round)) %>%
ungroup()
# define order of the facets within the plot
facet_order <- c("United States", "Australia", "Sweden", "France", "Spain", "Other")
# create annotation layer for plot
annotation_df <- plot_df_1 %>%
mutate(label = round(perc_first_round * 100, 1),
num_appearances = round(perc_first_round * 256)) %>%
filter(year == max(plot_df_1$year)|
year == min(plot_df_1$year))
# ----------------------------------------------------------------------------
# Produce the plot
# ----------------------------------------------------------------------------
p <- ggplot(plot_df_1,
aes(year, perc_first_round)) +
# core plot
geom_area(aes(group = country),
colour = "grey80", size = 0.2) +
# annotation
ggrepel::geom_text_repel(data = annotation_df,
mapping = aes(label = num_appearances)) +
# create small multiples
facet_wrap(~factor(country, levels = facet_order))
pcountry affects
# ----------------------------------------------------------------------------
# Reshape data for plotting
# ----------------------------------------------------------------------------
gs_first_round_gdp %>%
count(year, country, iso) %>%
group_by(year) %>%
mutate(perc_appear = n / sum(n)) %>%
filter(iso %in% c("USA", "FRA", "ESP")) %>%
# --------------------------------------------------------------------------
# Produce the plot
# --------------------------------------------------------------------------
ggplot() +
geom_line(aes(year, perc_appear, colour = country))# ----------------------------------------------------------------------------
# Reshape data for plotting
# ----------------------------------------------------------------------------
country_group_counts <- gs_first_round_gdp %>%
count(year, country, iso) %>%
mutate(colour = if_else(
iso %in% c("USA", "FRA", "ESP"), iso, "other"
))
# ----------------------------------------------------------------------------
# Produce the plot
# ----------------------------------------------------------------------------
country_group_counts %>%
ggplot() +
geom_line(aes(year, n, group = country, colour = colour)) +
scale_y_continuous(expand = c(0,0)) +
scale_x_continuous(expand = c(0,0),
breaks=c(1990,1995,2000,2005,2010,2015,2019)) +
coord_cartesian(clip = 'off') +
labs(x = NULL, y = NULL) +
theme(legend.position = "none",
plot.margin = margin(25,25,25,25),
panel.grid.minor = element_blank()) ggsave("../images/image_7.svg", units = "mm", width = 525, height = 350)
country_group_counts %>%
filter(iso == "USA")## # A tibble: 30 × 5
## year country iso n colour
## <dbl> <chr> <chr> <int> <chr>
## 1 1990 United States USA 262 USA
## 2 1991 United States USA 232 USA
## 3 1992 United States USA 224 USA
## 4 1993 United States USA 222 USA
## 5 1994 United States USA 198 USA
## 6 1995 United States USA 188 USA
## 7 1996 United States USA 156 USA
## 8 1997 United States USA 144 USA
## 9 1998 United States USA 139 USA
## 10 1999 United States USA 138 USA
## # … with 20 more rows
top_3 <- gs_first_round_gdp %>%
count(year, country) %>%
arrange(year, desc(n)) %>%
group_by(year) %>%
mutate(rank = rank(-n)) %>%
ungroup() %>%
filter(rank == 1 | rank == 2 | rank == 3)Countries outside top twenty percent
# ----------------------------------------------------------------------------
# Reshape data for plotting
# ----------------------------------------------------------------------------
gs_entries_the_other_80 <- gs_entries_by_country %>%
filter(!top_20_perc) %>%
count(year, country, wt = num_first_rd)
grid <- expand_grid(country = unique(gs_entries_the_other_80$country),
year = unique(gs_entries_the_other_80$year))
plot_df <- grid %>%
left_join(gs_entries_the_other_80) %>%
mutate(n = replace_na(n, 0))
# ----------------------------------------------------------------------------
# Produce the plots
# ----------------------------------------------------------------------------
p <- ggplot(plot_df,
aes(year, n, group = country)) +
geom_line()
pplotly::ggplotly(p)p <- ggplot(plot_df,
aes(year, n, fill = country)) +
geom_area()
pplotly::ggplotly(p)gs_entries_by_country %>%
filter(str_detect(str_to_lower(country), "cz"))## # A tibble: 120 × 8
## year tourney_name country_code country gdp_per_…¹ num_f…² incom…³ top_2…⁴
## <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <lgl>
## 1 1990 Australian Open CZE Czechia 23585. 10 8 FALSE
## 2 1990 Roland Garros CZE Czechia 23585. 11 8 FALSE
## 3 1990 US Open CZE Czechia 23585. 10 8 FALSE
## 4 1990 Wimbledon CZE Czechia 23585. 10 8 FALSE
## 5 1991 Australian Open CZE Czechia 20896. 11 8 FALSE
## 6 1991 Roland Garros CZE Czechia 20896. 10 8 FALSE
## 7 1991 US Open CZE Czechia 20896. 8 8 FALSE
## 8 1991 Wimbledon CZE Czechia 20896. 9 8 FALSE
## 9 1992 Australian Open CZE Czechia 20769. 9 8 FALSE
## 10 1992 Roland Garros CZE Czechia 20769. 7 8 FALSE
## # … with 110 more rows, and abbreviated variable names ¹gdp_per_capita,
## # ²num_first_rd, ³income_decile, ⁴top_20_perc
# ----------------------------------------------------------------------------
# Produce the plot
# ----------------------------------------------------------------------------
countries_of_int <- c("Russia", "Argentina", "Czechia")
plot_df %>%
mutate(colour = if_else(country %in% countries_of_int,
country, "other")) %>%
filter(!(country == "Czechia" & year >= 2017)) %>%
ggplot(aes(year, n,
colour = colour,
group = country)) +
geom_line() +
scale_colour_manual(values = c("#A7BCD6", "#35469D", "grey95", "#C94A54")) +
scale_y_continuous(expand = c(0,0)) +
scale_x_continuous(expand = c(0,0),
breaks=c(1990,1995,2000,2005,2010,2015,2019)) +
coord_cartesian(clip = 'off') +
labs(x = NULL, y = NULL) +
theme(legend.position = "none",
plot.margin = margin(25,25,25,25),
panel.grid.minor = element_blank()) ggsave("../images/image_8.svg", units = "mm", width = 525, height = 350)Bottom 50 percent
# ----------------------------------------------------------------------------
# Reshape data for plotting
# ----------------------------------------------------------------------------
bottom_50_perc <- gs_entries_by_country %>%
filter(income_decile <= 5)
# num countries with first round appearances
bottom_50_perc %>%
distinct(year, country) %>%
count(year) %>%
ggplot() +
geom_col(aes(year, n), width = 0.8) +
scale_y_continuous(expand = expansion(mult = c(0, .1))) +
scale_x_continuous(expand = c(0,0),
breaks=c(1990,1995,2000,2005,2010,2015,2019)) +
coord_cartesian(clip = 'off') +
labs(x = NULL, y = NULL) +
theme(legend.position = "none",
plot.margin = margin(25,25,25,25),
panel.grid.minor = element_blank()) ggsave("../images/image_9.svg", units = "mm", width = 525, height = 350)
# for annotation
bottom_50_perc %>%
distinct(year, country) %>%
filter(year == 1990 | year == 2013)## # A tibble: 7 × 2
## year country
## <dbl> <chr>
## 1 1990 India
## 2 1990 Nigeria
## 3 1990 Peru
## 4 2013 China
## 5 2013 Uzbekistan
## 6 2013 India
## 7 2013 Georgia
# countries from bottom 50 percent with most appearances in first round 1990 - 2019
top_n <- 5
top_n_countries <- bottom_50_perc %>%
count(country) %>%
slice_max(order_by = n, n = top_n) %>%
.$country
bottom_50_perc %>%
distinct(year,country) %>%
filter(year == max(year)) %>%
.$country## [1] "Ukraine" "South Africa" "Tunisia" "India" "Uzbekistan"
## [6] "Moldova" "Bolivia"
plot_df <- bottom_50_perc %>%
count(year, country) %>%
mutate(country = if_else(country %in% top_n_countries,
country,
"Other")) %>%
group_by(year, country) %>%
summarise(n = sum(n))
# ----------------------------------------------------------------------------
# Produce the plot
# ----------------------------------------------------------------------------
ggplot(plot_df) +
geom_col(aes(year, n, fill = country))# for story text
bottom_50_perc %>%
distinct(year, country) %>%
filter(year == 2003)## # A tibble: 17 × 2
## year country
## <dbl> <chr>
## 1 2003 Belarus
## 2 2003 Morocco
## 3 2003 Indonesia
## 4 2003 Philippines
## 5 2003 Peru
## 6 2003 Armenia
## 7 2003 Ecuador
## 8 2003 Georgia
## 9 2003 Paraguay
## 10 2003 Uzbekistan
## 11 2003 Zimbabwe
## 12 2003 Colombia
## 13 2003 Madagascar
## 14 2003 Ukraine
## 15 2003 Bosnia and Herzegovina
## 16 2003 China
## 17 2003 Tunisia
# ----------------------------------------------------------------------------
# Reshape data for plotting
# ----------------------------------------------------------------------------
bottom_50_perc_country_counts <- bottom_50_perc %>%
count(year, country, wt = num_first_rd) %>%
group_by(country) %>%
mutate(country_ave_n = mean(n, na.rm = TRUE)) %>%
ungroup()
# look at how many times the countries appear in the bottom fifty percent
bottom_50_summary <- bottom_50_perc_country_counts %>%
count(country) %>%
rename(total_n = n) %>%
left_join(distinct(
bottom_50_perc_country_counts,
country,
country_ave_n
)) %>%
arrange(desc(country_ave_n))
# bottom_50_perc_country_counts <- bottom_50_perc_country_counts %>%
# left_join(bottom_50_summary)
# create a grid to see where countries have moved out of bottom 50 perc
grid <- expand_grid(year = unique(bottom_50_perc_country_counts$year),
country = unique(bottom_50_perc_country_counts$country))
plot_df <- grid %>%
left_join(bottom_50_perc_country_counts)
# deciles by year
top_50_countries_by_year <- gs_entries_by_country %>%
distinct(year, country, income_decile) %>%
filter(income_decile > 5)
check_top_50 <- function(year, country){
selector <- top_50_countries_by_year$year == year &
top_50_countries_by_year$country == country
res <- top_50_countries_by_year[selector, ]
if(nrow(res) == 0){
return(FALSE)
}
else {
return(res[[1,"income_decile"]] > 5)
}
}
check_top_50(1991, "United States")## [1] TRUE
plot_df_1 <- plot_df %>%
mutate(in_top_50 = map2_lgl(year, country, ~check_top_50(.x,.y)),
n = if_else(is.na(n) & in_top_50,
-1, n),
n = replace_na(n, 0),
bin_n = cut(n, breaks = c(-Inf, -1e-10,0,1e10, 5, 10, 20, Inf)))
# for ordering exploratory plot
levels <- rev(bottom_50_summary$country)
# ----------------------------------------------------------------------------
# Produce the plot
# ----------------------------------------------------------------------------
ggplot(plot_df_1) +
geom_tile(aes(x = year,
y = factor(country, levels = levels),
fill = bin_n),
colour = "#E7E4E5") +
labs(x= NULL, y = NULL) +
coord_equal() +
# https://gka.github.io/palettes/#/5|s|ffffff,35469d|ffffe0,ff005e,93003a|1|1
scale_fill_manual(values = c("#E7E4E5", "white", '#cfcde7', '#9f9dce', '#6e70b6', '#35469d')) +
#scale_fill_gradient(low = "#D6DCE0", high = "#000DA8") +
theme_minimal() +
theme()ggsave("test_out.svg")library(ggridges)
# ----------------------------------------------------------------------------
# Reshape data for plotting
# ----------------------------------------------------------------------------
bottom_50_countries <- bottom_50_summary %>%
slice_head(n = 10) %>%
.$country
# ----------------------------------------------------------------------------
# Produce the plot
# ----------------------------------------------------------------------------
gs_first_round_gdp %>%
filter(country %in% bottom_50_countries) %>%
ggplot() +
ggridges::geom_density_ridges(mapping = aes(year,
factor(country, levels = levels),
height = stat(density)),
stat = "density")